📊 Real-Time Product Clickstream Analytics with Kafka, Spark & Airflow
🔍 Overview
This project simulates a real-time e-commerce clickstream analytics pipeline. It captures, processes, and visualizes product click data using a modern big data stack, demonstrating how real-time streaming and interactive dashboards power decision-making in scalable systems.
🧭 Approach
We built an end-to-end architecture with Kafka producers injecting synthetic clickstream events, Spark Structured Streaming aggregating product views in 10-second windows, and Apache Airflow scheduling the workflow. Data is stored in both Parquet and CSV, which are then visualized in a Flask dashboard and Tableau.
⚙️ Methodologies
- Kafka: Event streaming from a Python producer
- Spark: Real-time transformation and windowed aggregation
- Airflow: DAG-driven scheduling of Spark streaming jobs
- Flask + Plotly: Real-time visualization with auto-refresh
- Tableau: Deep analysis of batch outputs
- Storage: Parquet & CSV batch files
🧰 Technologies
- Languages: Python, HTML/CSS
- Libraries: PySpark, Pandas, Plotly, Flask
- Platforms: Apache Kafka, Apache Spark, Apache Airflow
- Storage: Parquet, CSV
- Visualization: Flask Dashboard, Tableau
💡 Key Learnings
- Developed production-style real-time data pipelines
- Orchestrated streaming with Airflow DAGs
- Created dashboards that auto-update based on live parquet files
- Exported clean CSVs for business tools like Tableau
- Handled structured streaming, watermarking, and fault tolerance
📈 Results
The project efficiently processed and aggregated clickstream data in real-time. Visual dashboards displayed popular products and time-based trends with minimal latency. Tableau offered deep-dive insights, while Flask served as a real-time monitoring UI.